Openmp #439

borisgin · 2014-05-22T13:25:58Z

Parallel version of caffe for CPU based on OpenMP. Significant speed-up on CPU .Scales well with number of cores (3x for 4 cores , 10x - for 16 cores). Modified files: constitutional , pulling and relu layers , im2col and col2im.

Yangqing · 2014-05-22T22:21:18Z

Is there a benchmark showing how much speedup it brings compared to a multi-thread blas implementation? For many layers (e.g. convolution) having a multithread blas is already fast and making explicit openmp parallelization makes code more complex with little improvement.

luotao1 · 2014-05-23T06:49:07Z

I have already tried the openmp version on im2col_cpu and col2im_cpu in conv_layer.cpp. But in my test, this change would decrease the performance of caffe_cpu_gemm in conv_layer.cpp, and the total elapsed time increased. Thus, I have to choose the pthread version on im2col_cpu and col2im_cpu. see pull #400. As far as now, I didn't find the reason why openmp version would decrease the performance of gemm.

borisgin · 2014-05-23T14:41:02Z

I do convolutional Forward() and Backward() in parallel for different images in the batch. As a result CPU code is much faster now: on desktop i saw 3.5x speed-up, and on server with 16 cores ~ 10x speed-up. Still it is not as fast as GPU. The change which I did in im2col and col2im was not related to OpenMP: I removed division by modification of for ( c....) , and added early exit for boundary cases.

borisgin · 2014-05-23T14:53:43Z

On benchmarks: I used cifar10 and imagenet train as bechmarks. I compared OpenMP with current dev version (CPU). i used 2 machines for testing: desktop (Ivybriidge, 1 socket x 4 cores x no HT).
and server (Sandybridge - 2 sockets x 8 cores x no HT)
cifar10 on desktop ": speed-up is 3.5x.
imagenet server: speed-up ~10x .

luotao1 · 2014-05-23T15:17:59Z

I saw your code. You do parallel on #pragma omp parallel for //shared(bottom,top) for (int n = 0; n < num_; ++n ) in Forward_cpu, is this right? This for iteration is data dependence, the i th result depends on the (i-1) th's, it can't be parallel.

borisgin · 2014-05-23T15:37:17Z

Forward() can be done independently for each image n , but I had to multiplicate buffer for im2col to avoid collision between threads. For Backward() I multiplicated buffer for weight_dif, and summed them all after all threads finish their jobs.

borisgin · 2014-05-25T18:06:06Z

I tested openmp version with MKL and found some issues which I would like to investigate.

borisgin · 2014-05-26T13:25:48Z

Fixed Makefile to support openmp with MKL. Current benchmark results:
cifar10/ train_quick):
desktop Core i7-3770K CPU @ 3.50GHz × 4 cores x no HT): 100 sec
K20: 90 sec
imagenet/quick train ( 200 train iteartion with 100 test iteration , each 100 iteration):
server 2 x Xeon E5-2680 @ 2.7 GHz) x 8 cores : 953 sec
K20: 640 sec

borisgin · 2014-05-26T14:58:06Z

Fixed OpenMP + MKL . Tested on imagenet (200 train iteration).Difference between CPU ( 2 socket server Xeon E5-2680 ) and GPU(K20) is 1.5x ( CPU is slower) /* For cifar10 CPU is faster */

borisgin · 2014-05-26T19:12:52Z

To reproduce results with MKL, you should 1. get a free non-commercial version here: https://software.intel.com/en-us/non-commercial-software-development
2. set environment (LD_LIBRARY_PATH):
export LD_LIBRARY_PATH=/opt/intel/composerxe/lib/intel64:/opt/intel/composerxe/mkl/lib/intel64:/opt/intel/mkl/lib/intel64:/usr/local/cuda-6.0/lib64:/usr/local/cuda-6.0/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=/opt/intel/composerxe/lib/intel64:/opt/intel/composerxe/mkl/lib/intel64:$LIBRARY_PATH

eladhoffer · 2014-05-26T19:18:01Z

Good job! will take a look

Yangqing · 2014-05-30T23:51:09Z

src/caffe/layers/conv_layer.cpp

+  col_buffer_mt_.resize(num_of_threads_ *
+      channels_ * kernel_size_ * kernel_size_ * height_out * width_out);
+  weight_diff_mt_.resize(num_of_threads_ *
+     num_output_ * (channels_ / group_)* kernel_size_ * kernel_size_);


As has also been pointed out in the discussion, I am worried about the intermediate data size: for example, for an input of size 55_55_256 and kernel size 3*3 (I am just making random numbers, may not correspond to actual imagenet layers) and convolution is with stride 1, a single buffer will have size

55 * 55 * 3 * 3 * 256 * sizeof(float) = 28MB

with multi threads (like 10) this will grow to like 280MB, and with multiple convolutional layers, this may quickly grow gigabytes. That's why I feel that we should rely on multithreading BLAS to speed things up rather than having a single-thread BLAS and explicit openmp code.

I tried to run caffe with large net OverFeat ( http://cilvr.nyu.edu/doku.php?id=software:overfeat:start ) on K20, but model did not fit K20 DRAM. I was able to train it on CPU. The footprint for 16 threads was ~ 16 GB(!). The train speed was ~ 5 images/sec. I used server with 32 GB , 2x E5x2670 2.6 GHz.

borisgin · 2014-05-31T05:31:12Z

MEMORY: Total memory overhead is realtively small . I run imagenet train on my desktop with OMP_NUM_THREADS= 4, and never see memory utilization above 5 GB. Since default configuration of desktop these days is 8 GB, this overhead seems to be not an issue. Even for servers, when I set OMP_NUM_THREADS= 24, the size was still around 5-5.5 GB.
SPEED-UP: OpenMP is 3x faster than best multi-threaded BLAS (MKL). For example , for imagenet training . for regular middle-range desktop the difference in speed of CPU vs K20 is ~ 2.5x. For 2 years old server the difference is even less: 1.3 - 1.4x.
Having fast CPU code is very convenient when someone want to add his algorithm, or modify existing one

jeffdonahue · 2014-05-31T06:28:30Z

Your benchmarks are very impressive, thanks for reporting these numbers! I'll hopefully be able to give this a try at some point, and would definitely be in favor of merging this if I see speed/memory numbers anywhere near what you report (perhaps at first into a new branch off dev, to make it more easily available while doing further testing and refinement if necessary). I'm pretty busy over the next few weeks though; maybe somebody else will beat me to it.

borisgin · 2014-05-31T09:54:10Z

These are setup details:

Desktop: CPU i7-4770 (Haswell), 3.5 GHz , DRAM - 16 GB; GPU K20.
Ubuntu 12.04; gcc 4.7.3; MKL 11.1.

Test:: imagenet, 100 train iteration (batch = 256).

GPU: time= 260 sec / memory = 0.8 GB
CPU: time= 752 sec / memory = 3.5 GiB
//Memory data is from system monitor.

jeffdonahue · 2014-07-17T00:12:00Z

Hi @borisgin, sorry it's taken me forever to get around to this.

Have you tested at all with OpenBLAS, or only MKL? I tried this out with OpenBLAS, and didn't see performance improve on a machine with 32

Here's what I did: Initially I was using the system OpenBLAS library, but I saw a bunch of error messages when I ran the training to recompile OpenBLAS with the option USE_OPENMP=1 defined. So I did that, linked to my new OpenBLAS build, and the error messages went away.

Using your branch, these are the results I see for imagenet training (.examples/imagenet/train_imagenet.sh, adding display: 1 and solver_mode: CPU to the bottom of imagenet_solver.prototxt):

I0716 16:53:59.961927 27214 solver.cpp:86] Solving CaffeNet
I0716 16:54:27.538390 27214 solver.cpp:272] Iteration 1, lr = 0.01
I0716 16:54:28.103991 27214 solver.cpp:112] Iteration 1, loss = 7.44474
I0716 16:54:52.166633 27214 solver.cpp:272] Iteration 2, lr = 0.01
I0716 16:54:52.470783 27214 solver.cpp:112] Iteration 2, loss = 7.34914
I0716 16:55:16.940089 27214 solver.cpp:272] Iteration 3, lr = 0.01
I0716 16:55:17.182062 27214 solver.cpp:112] Iteration 3, loss = 7.27532
I0716 16:55:42.486449 27214 solver.cpp:272] Iteration 4, lr = 0.01
I0716 16:55:42.766113 27214 solver.cpp:112] Iteration 4, loss = 7.35052
I0716 16:56:07.605473 27214 solver.cpp:272] Iteration 5, lr = 0.01
I0716 16:56:07.901649 27214 solver.cpp:112] Iteration 5, loss = 7.34963
I0716 16:56:33.244210 27214 solver.cpp:272] Iteration 6, lr = 0.01
I0716 16:56:33.531226 27214 solver.cpp:112] Iteration 6, loss = 7.49167
I0716 16:57:00.126989 27214 solver.cpp:272] Iteration 7, lr = 0.01
I0716 16:57:00.405313 27214 solver.cpp:112] Iteration 7, loss = 7.42552
I0716 16:57:26.207607 27214 solver.cpp:272] Iteration 8, lr = 0.01
I0716 16:57:26.506078 27214 solver.cpp:112] Iteration 8, loss = 7.49404
I0716 16:57:53.265570 27214 solver.cpp:272] Iteration 9, lr = 0.01
I0716 16:57:53.539079 27214 solver.cpp:112] Iteration 9, loss = 7.34146
I0716 16:58:21.023488 27214 solver.cpp:272] Iteration 10, lr = 0.01
I0716 16:58:21.394134 27214 solver.cpp:112] Iteration 10, loss = 7.4269
I0716 16:58:48.360399 27214 solver.cpp:272] Iteration 11, lr = 0.01
I0716 16:58:48.687850 27214 solver.cpp:112] Iteration 11, loss = 7.36875

Around 26 seconds per iteration. Then, I reran with export OMP_NUM_THREADS=8:

I0716 16:59:13.466909 27348 solver.cpp:86] Solving CaffeNet
^[[O^[[I^[[O^[[II0716 16:59:43.876104 27348 solver.cpp:272] Iteration 1, lr = 0.01
I0716 16:59:44.445726 27348 solver.cpp:112] Iteration 1, loss = 7.44474
I0716 17:00:11.324043 27348 solver.cpp:272] Iteration 2, lr = 0.01
I0716 17:00:11.571719 27348 solver.cpp:112] Iteration 2, loss = 7.34914
I0716 17:00:38.797961 27348 solver.cpp:272] Iteration 3, lr = 0.01
I0716 17:00:39.049752 27348 solver.cpp:112] Iteration 3, loss = 7.27533
I0716 17:01:06.600169 27348 solver.cpp:272] Iteration 4, lr = 0.01
I0716 17:01:06.855245 27348 solver.cpp:112] Iteration 4, loss = 7.35052
I0716 17:01:34.575659 27348 solver.cpp:272] Iteration 5, lr = 0.01
I0716 17:01:34.832867 27348 solver.cpp:112] Iteration 5, loss = 7.34962
I0716 17:02:02.975780 27348 solver.cpp:272] Iteration 6, lr = 0.01
I0716 17:02:03.227632 27348 solver.cpp:112] Iteration 6, loss = 7.4917
I0716 17:02:32.491924 27348 solver.cpp:272] Iteration 7, lr = 0.01
I0716 17:02:32.749202 27348 solver.cpp:112] Iteration 7, loss = 7.42552
I0716 17:03:01.581933 27348 solver.cpp:272] Iteration 8, lr = 0.01
I0716 17:03:01.837805 27348 solver.cpp:112] Iteration 8, loss = 7.49396
I0716 17:03:31.338582 27348 solver.cpp:272] Iteration 9, lr = 0.01
I0716 17:03:31.592015 27348 solver.cpp:112] Iteration 9, loss = 7.34149
I0716 17:04:01.258724 27348 solver.cpp:272] Iteration 10, lr = 0.01
I0716 17:04:01.507393 27348 solver.cpp:112] Iteration 10, loss = 7.42689
I0716 17:04:31.262023 27348 solver.cpp:272] Iteration 11, lr = 0.01
I0716 17:04:31.515398 27348 solver.cpp:112] Iteration 11, loss = 7.36892

That's ~29 seconds per iteration.

I'm using an 8 core machine.

Do you think I could've done something wrong with the setup, and/or do you expect this will only work well with MKL? I can retry it with MKL if you think that's the problem. (I notice that when I open "top" running with just 1 OMP thread, I see CPU utilization often >1000%, so it seems like the blas library is doing quite a bit of parallelization already as @Yangqing suggested.)

luotao1 · 2014-07-17T01:29:42Z

I test borisgin's improvement, it indeed speeds up on CPU, but not scales well like him said. Our result is 3x+ for 16 cores. @jeffdonahue , is your 8 core machine all physical cores?

jeffdonahue · 2014-07-17T01:35:37Z

Ah, nice catch -- there are actually only 2 physical cores on the machine I was using. I had looked at /proc/cpuinfo and misinterpreted the result (even knowing that it is easily misinterpretable)...it prints 32 "processors" with cpu cores: "8", but in fact the "physical id" is only 0 or 1 so I guess I have 2 physical cores. I will try again on a more powerful machine, sorry.

OpenHero · 2014-07-17T02:15:05Z

@jeffdonahue it means you have 2 physical cpus, each cpu has 8 cores, each core have 2 hyper-thread. So you can get 32 threads( processor).

Get the CPU

grep 'physical id' /proc/cpuinfo | sort -u
2. Get the core

grep 'core id' /proc/cpuinfo | sort -u | wc -l
3. Get the threads

grep 'processor' /proc/cpuinfo | sort -u | wc -l
4.Get the cpu version

dmidecode -s processor-version

Check the OpenBLAS compiler command with OpenMP. Or set them with "void openblas_set_num_threads(int num_threads);"

By default with USE_OPENMP=1 it will use all the threads said by @xianyi

luotao1 · 2014-07-17T02:26:07Z

@jeffdonahue your machine has 2 cpu, each cpu has 8 core, So the number of physical core is 16, the processor(hyper-thread)=32. Thus, you can set OMP_THREADS_NUM=16. But how is your batchsize? when using openmp, the overhead of creating threads is not trivial, you can not use very small batchsize.

borisgin · 2014-07-17T06:28:50Z

Hi, @jeffdonahue , I did not tested OpenMP + OpenBLAS combination. I tested it with SMT (hyper-threading disabled in BIOS, since it does not help for matrix multiplication at all. Did you re-build OpenBLAS for your CPU? I have server with 2 sockets x E5-2680 , each CPU has 8 cores/8 threads, 2.7 GHZ. I run imagenet train with batch size = 256 for 100 iterations.
K20 : 305 sec.
CPU with BLAS = atlas : 18517 sec,
CPU with BLAS = open : 1916 sec
CPU with BLAS = mkl: 1443
CPU with BLAS = mkl and with OpenMP: 419 sec

xianyi · 2014-07-17T06:53:54Z

Hi , @borisgin ,
Thank you for the test. What's OpenBLAS version?

borisgin · 2014-07-17T07:21:56Z

First I used what I get with "sudo apt-get install libopenblas-base"
but then I download OpenBLAS xianyi-OpenBLAS-347dded and rebuild it since I wanted to retest for Haswell CPU with AVX2

xianyi · 2014-07-17T07:32:40Z

Hi , @borisgin ,

Intel Xeon E5-2680 is sandy bridge arch, not haswell.

For sandy bridge arch, OpenBLAS 0.2.9 rolled back the sgemm kernel to the old core2 kernel. In latest 0.2.10 version, we enabled optimized sgemm kernel for sandy bridge arch.

Could you retest the performance with the new OpenBLAS version?

borisgin · 2014-07-17T07:54:51Z

I got new desktop with Haswell CPU, so I wanted to rebuild the code to check how new AVX2 instructions (FMA) impact the performance. I will retest on it the performance for OpenBLAS vs (OpenBlas + OpenMP).

bhack · 2014-07-17T08:16:12Z

@borisgin we need to evaluate that on Osx with clang we will not have openmp available on the shelf.
Please take a look here #689

borisgin · 2014-07-17T11:51:02Z

Hi, @ xianyi,
I retested BLAS = open with OpenMP.
System configuration: desktop with CPU i7-4960X CPU.
OpenBLAS = 0.2.9, build with gcc 4.6.4 "make USE_OPENMP=1"
Test: imagenet_train, batch = 256, time below is for 10 train iterations (+ small overhead in the start and in the end):
BLAS = open: 169 sec
BLAS = open + openMP: 124 sec
BLAS = mkl: 97 sec
BLAS = mkl + openMP : 62 sec

borisgin · 2014-11-18T19:29:25Z

Rebased openmp with latest dev branch:

added openmp tol lrn layer
fixed the performance bug related to mkl alternative for powx(...) function

borisgin · 2014-11-22T10:32:47Z

I found curious performance bug, related to MKL alternative implementation of functions which are lacking in OpenBLAS.
PROBLEM: I noticed that when you run speed benchmark with OpenBLAS, the normalization layer takes very long time - 1193 msec.. When I replaced openBLAS with MKL, the normalization layer took only 50 msec.
When I did profiling, turned out that the most of the time is spen in lrn_layer.cpp t in following function:
caffe_powx(scale_.count(), scale_data, -beta_, top_data);
which is tyhe reimplementation of MKL function in mkl_alternate.hpp
DEFINE_VSL_UNARY_FUNC_WITH_PARAM(Powx, y[i] = pow(a[i], b));

I did some quick fix in lrn_layer.cpp:
#ifdef USE_MKL
caffe_powx(scale_.count(), scale_data, -beta_, top_data);
caffe_mul(scale_.count(), top_data, bottom_data, top_data);
#else
#ifdef OPENMP
#pragma omp parallel for
#endif
for (int i = 0; i < scale.count(); i++) {
// top_data[i] = pow(scale_data[i],-beta_) * bottom_data[i];
top_data[i] = exp(log(scale_data[i]) *(-beta_)) * bottom_data[i];
}
#endif
}
But this version it is still slower than MKL, which is based on vector version of powd.

@xianyi, any plans to add fast powd to OpenBLAS?

Thanks, Boris

caffe with OPENBLAS
....
Average time per layer:
I1117 14:03:46.019644 17778 caffe.cpp:265] data forward: 16.3546 ms.
I1117 14:03:46.019652 17778 caffe.cpp:268] data backward: 0.00066 ms.
I1117 14:03:46.019661 17778 caffe.cpp:265] conv1 forward: 837.748 ms.
I1117 14:03:46.019670 17778 caffe.cpp:268] conv1 backward: 656.368 ms.
I1117 14:03:46.019681 17778 caffe.cpp:265] relu1 forward: 59.143 ms.
I1117 14:03:46.019690 17778 caffe.cpp:268] relu1 backward: 88.6836 ms.
I1117 14:03:46.019701 17778 caffe.cpp:265] pool1 forward: 477.523 ms.
I1117 14:03:46.019711 17778 caffe.cpp:268] pool1 backward: 111.665 ms.
I1117 14:03:46.019722 17778 caffe.cpp:265] norm1 forward: 1193.3 ms.
I1117 14:03:46.019732 17778 caffe.cpp:268] norm1 backward: 1229.55 ms.
I1117 14:03:46.019742 17778 caffe.cpp:265] conv2 forward: 1174.91 ms.
I1117 14:03:46.019753 17778 caffe.cpp:268] conv2 backward: 2174.22 ms.
I1117 14:03:46.019763 17778 caffe.cpp:265] relu2 forward: 38.0219 ms.
I1117 14:03:46.019774 17778 caffe.cpp:268] relu2 backward: 57.023 ms.
I1117 14:03:46.019783 17778 caffe.cpp:265] pool2 forward: 335.518 ms.
I1117 14:03:46.019794 17778 caffe.cpp:268] pool2 backward: 70.0151 ms.
I1117 14:03:46.019804 17778 caffe.cpp:265] norm2 forward: 667.69 ms.

I1117 14:03:46.019815 17778 caffe.cpp:268] norm2 backward: 692.64 ms.

caffe with MKL:
I1119 10:58:23.669105 29606 caffe.cpp:262] Average time per layer:
I1119 10:58:23.669136 29606 caffe.cpp:265] data forward: 16.7308 ms.
I1119 10:58:23.669189 29606 caffe.cpp:268] data backward: 0.00088 ms.
I1119 10:58:23.669239 29606 caffe.cpp:265] conv1 forward: 745.611 ms.
I1119 10:58:23.669287 29606 caffe.cpp:268] conv1 backward: 640.988 ms.
I1119 10:58:23.669335 29606 caffe.cpp:265] relu1 forward: 62.2671 ms.
I1119 10:58:23.669383 29606 caffe.cpp:268] relu1 backward: 82.4027 ms.
I1119 10:58:23.669430 29606 caffe.cpp:265] pool1 forward: 490.123 ms.
I1119 10:58:23.669476 29606 caffe.cpp:268] pool1 backward: 127.349 ms.
I1119 10:58:23.669522 29606 caffe.cpp:265] norm1 forward: 51.0725 ms.
I1119 10:58:23.669569 29606 caffe.cpp:268] norm1 backward: 63.3013 ms.
I1119 10:58:23.669620 29606 caffe.cpp:265] conv2 forward: 1086.59 ms.
I1119 10:58:23.669667 29606 caffe.cpp:268] conv2 backward: 2024.77 ms.
I1119 10:58:23.669713 29606 caffe.cpp:265] relu2 forward: 39.8503 ms.
I1119 10:58:23.673890 29606 caffe.cpp:268] relu2 backward: 51.6888 ms.
I1119 10:58:23.673945 29606 caffe.cpp:265] pool2 forward: 330.297 ms.
I1119 10:58:23.673995 29606 caffe.cpp:268] pool2 backward: 79.1423 ms.
I1119 10:58:23.674042 29606 caffe.cpp:265] norm2 forward: 33.7946 ms.

I1119 10:58:23.674092 29606 caffe.cpp:268] norm2 backward: 47.9396 ms.

caffe with MKL and OpenMP
I1119 10:42:58.466241 28810 caffe.cpp:262] Average time per layer:
I1119 10:42:58.466248 28810 caffe.cpp:265] data forward: 17.1041 ms.
I1119 10:42:58.466264 28810 caffe.cpp:268] data backward: 0.00084 ms.
I1119 10:42:58.466272 28810 caffe.cpp:265] conv1 forward: 502.285 ms.
I1119 10:42:58.466284 28810 caffe.cpp:268] conv1 backward: 397.606 ms.
I1119 10:42:58.466295 28810 caffe.cpp:265] relu1 forward: 30.1847 ms.
I1119 10:42:58.466306 28810 caffe.cpp:268] relu1 backward: 44.3305 ms.
I1119 10:42:58.466316 28810 caffe.cpp:265] pool1 forward: 146.703 ms.
I1119 10:42:58.466327 28810 caffe.cpp:268] pool1 backward: 51.1912 ms.
I1119 10:42:58.466337 28810 caffe.cpp:265] norm1 forward: 42.2056 ms.
I1119 10:42:58.466347 28810 caffe.cpp:268] norm1 backward: 44.3377 ms.
I1119 10:42:58.466359 28810 caffe.cpp:265] conv2 forward: 719.682 ms.
I1119 10:42:58.466370 28810 caffe.cpp:268] conv2 backward: 1408.03 ms.
I1119 10:42:58.466383 28810 caffe.cpp:265] relu2 forward: 19.3619 ms.
I1119 10:42:58.466393 28810 caffe.cpp:268] relu2 backward: 29.7551 ms.
I1119 10:42:58.466404 28810 caffe.cpp:265] pool2 forward: 100.753 ms.
I1119 10:42:58.466415 28810 caffe.cpp:268] pool2 backward: 33.027 ms.
I1119 10:42:58.466426 28810 caffe.cpp:265] norm2 forward: 25.7057 ms.
I1119 10:42:58.466438 28810 caffe.cpp:268] norm2 backward: 29.2872 ms.

xianyi · 2014-11-24T13:08:32Z

@borisgin , I already added a feature request for this function.

forresti · 2015-06-24T08:38:26Z

Is this PR (or something similar) going to be merged soon?

When I checked not too long ago, CPU Caffe is unnecessarily slow without OpenMP. I'm debating tacking my own OpenMP things together... but merging this PR with present-day Caffe would be optimal.

talda · 2015-06-24T18:11:24Z

I wouldn't hold my breath. This PR is more than a year old and it doesn't seem the maintainers think it should be merged. (probably better to close it)

shelhamer · 2015-08-26T00:08:02Z

While CPU execution can be further optimized this PR is closed since it is against the deprecated dev branch. This branch was not merged at the time due to concerns about further complexity and dependencies. Thanks for your work @borisgin.

bhack · 2015-08-26T00:09:58Z

@naibaf7 It is working also on CPU at #2610

naibaf7 · 2015-08-26T00:31:41Z

@shelhamer @bhack
#2610 Uses OpenCL kernels and a CPU (MKL, OpenBLAS, Atlas) BLAS.
While OpenCL might not be the best for CPUs, the Alexnet runs twice as fast on Hybrid-OpenCL (that's what I call it) than the "legacy" singlethreaded CPU backend on a quadcore CPU.

Crefeda · 2016-01-04T14:39:50Z

Hi, I was try to build your openmp version, on a CentOS 6.5 computer and I got a protobuf version error
my current one is libprotoc 2.4.1
could you let me know which version was compatible for you?

borisgin · 2016-01-05T01:49:09Z

Hi Crefeda,
Can you post a detailed error, please?
Did caffe-master branch build passed without issues?
Thanks, Boris

Yangqing · 2016-01-05T02:57:03Z

@Crefeda If I recall correctly, caffe relies on some features that are introduced in protobuf 2.5, so I think 2.5 and above should work.

Yangqing mentioned this pull request May 22, 2014

Openmp #437

Closed

borisgin closed this May 23, 2014

borisgin reopened this May 26, 2014

Yangqing reviewed May 30, 2014
View reviewed changes

shelhamer force-pushed the dev branch from 64258b6 to 403b56b Compare September 19, 2014 04:38

shelhamer force-pushed the dev branch from d8eb4df to 914da95 Compare October 8, 2014 16:36

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

rebased wrt dev

a5ba576

borisgin force-pushed the openmp branch from 4932422 to a5ba576 Compare November 18, 2014 18:18

borisgin added 2 commits November 18, 2014 20:33

returned default BLAS := atlas

1a45190

add #ifdef _OPENMP wrappers to supress gcc warnings for omp pragmas

936af33

xianyi mentioned this pull request Nov 24, 2014

Add VML powx function OpenMathLib/OpenBLAS#468

Closed

shelhamer mentioned this pull request Nov 27, 2014

Can Caffe be limited to use only one core in CPU-mode? #1492

Closed

ducha-aiki mentioned this pull request Dec 8, 2014

In CPU mode, using multi-core is much slower than using a single core #1539

Closed

ducha-aiki mentioned this pull request Dec 29, 2014

Community issues and policy feedbacks #1623

Closed

shelhamer modified the milestone: Future Dec 30, 2014

shelhamer added JD JL ES labels Mar 10, 2015

shelhamer mentioned this pull request Aug 25, 2015

Clear all content out of dev except README.md #2081

Closed

shelhamer closed this Aug 26, 2015

jczaja mentioned this pull request Nov 19, 2015

Openmp for convolution, deconvolution and relu #3358

Closed

Openmp #439

Openmp #439

Conversation

borisgin commented May 22, 2014

Yangqing commented May 22, 2014

luotao1 commented May 23, 2014

borisgin commented May 23, 2014

borisgin commented May 23, 2014

luotao1 commented May 23, 2014

borisgin commented May 23, 2014

borisgin commented May 25, 2014

borisgin commented May 26, 2014

borisgin commented May 26, 2014

borisgin commented May 26, 2014

eladhoffer commented May 26, 2014

Yangqing May 30, 2014

Choose a reason for hiding this comment

borisgin Jun 2, 2014

Choose a reason for hiding this comment

borisgin commented May 31, 2014

jeffdonahue commented May 31, 2014

borisgin commented May 31, 2014

jeffdonahue commented Jul 17, 2014

luotao1 commented Jul 17, 2014

jeffdonahue commented Jul 17, 2014

OpenHero commented Jul 17, 2014

luotao1 commented Jul 17, 2014

borisgin commented Jul 17, 2014

xianyi commented Jul 17, 2014

borisgin commented Jul 17, 2014

xianyi commented Jul 17, 2014

borisgin commented Jul 17, 2014

bhack commented Jul 17, 2014

borisgin commented Jul 17, 2014

borisgin commented Nov 18, 2014

borisgin commented Nov 22, 2014

Thanks, Boris

I1117 14:03:46.019815 17778 caffe.cpp:268] norm2 backward: 692.64 ms.

I1119 10:58:23.674092 29606 caffe.cpp:268] norm2 backward: 47.9396 ms.

xianyi commented Nov 24, 2014

forresti commented Jun 24, 2015

talda commented Jun 24, 2015

shelhamer commented Aug 26, 2015

bhack commented Aug 26, 2015

naibaf7 commented Aug 26, 2015

Crefeda commented Jan 4, 2016

borisgin commented Jan 5, 2016

Yangqing commented Jan 5, 2016